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Extended BPF 
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A New Type of Software 
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MULTIPLIER QUOTIENT 


Brendan Gregg 
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50 Years, one (dominant) OS model 


Applications 


System Calls 


Kernel 


Hardware 






Origins: Multics, 
1960s 


Supervisor 



Privilege 










Modern Linux: A new OS model 


User-mode 


Kernel-mode 

Applications 


Applications (BPF) 

System Calls 


BPF Helper Calls 

Kernel 

Hardware 













50 Years, one process state model 

































BPF program state model 


Off-CPU On-CPU 


















Netconf 2018 
Alexei Starvoitov 



BPF verifier in the future 

. move away from existing brute force "walk all instructions" ap 
technology and static analysis 

• remove #define BPF_COMPLEXITY_LIMIT 128k crutch 

• remove #define BPF_MAXINSNS 4k 

• support arbitrary large programs and libraries 

• 1 Million BPF instructions 

• an algorithm to solve Rubik's cube will be expressible in BPF 




BPF at Facebook 


- -40 BPF programs active on every server. 

- -100 BPF programs loaded on demand for short period of time. 

- Mainly used by daemons that run on every server. 

- Many teams are writing and deploying them. 


ri TZr> 


schedu 


Kernels^ 

Recipes- 


ftracc Where modifying a running heri 


Analyzing changes to the binary interf; 


BPF at Facebooh - Alexei Starovoitov 



Kernel Recipes 2019, Alexei Starovoitov 

~40 active BPF programs on every Facebook server 





NETFLIX 


>150k AWS EC2 Ubuntu server instances 


ubuntu 


~34% US Internet traffic at night 
>130M subscribers 



~14 active BPF programs on every instance (so far) 


NETFLIX 



Modern Linux: Event-based Applications 



































Modern Linux is becoming Microkernel-ish 


User-mode 

Applications 


Kernel-mode 
Services & Drivers 





Smaller 

Kernel 


Hardware 


The word “microkernel” has already been invoked by Jonathan Corbet, Thomas Graf, Greg Kroah-Hartman,... 





















Steven Rostedt 

(cpsrostedt 


BPF will replace Linux #kr2019 


2:06 AM • Sep 26, 2019 • Twitter For Android 


18 Retweets 79 Likes 




BPF 



BPF 1992: Berkeley Packet Filter 


# tcpdump -d 

host 127.0.0.1 

and port 

30 


(000) 

ldh 

[12] 





(001) 

jeq 

#0x800 

jt 

2 

jf 

18 

(002) 

Id 

[26] 





(003) 

jeq 

#0x7f000001 

jt 

6 

jf 

4 

(004) 

Id 

[30] 





(005) 

jeq 

#0x7f000001 

jt 

6 

jf 

18 

(006) 

ldb 

[23] 





(007) 

jeq 

#0x84 

jt 

10 

jf 

8 

(008) 

jeq 

#0x6 

jt 

10 

jf 

9 

(009) 

jeq 

#0x11 

jt 

10 

jf 

18 

(010) 

ldh 

[20] 





(Oil) 

jset 

#0xlfff 

jt 

18 

jf 

12 

(012) 

ldxb 

4*([14]&0xf) 





(013) 

ldh 

[x + 14] 





(014) 

jeq 

#0x50 

jt 

17 

jf 

15 

(015) 

ldh 

[x + 16] 





(016) 

jeq 

#0x50 

jt 

17 

jf 

18 

(017) 

ret 

#262144 





(018) 

ret 

#0 






A limited 

virtual machine for 

efficient packet filters 



BPF 2019: aka extended BPF 



8§o cilium 




-PLUMBERS CONFERENCE 


BPF microconference 


& Facebook Katran, Google KRSI, Netflix flowsrus, 

and many more 















BPF2019 


User-Defined BPF Programs 



Device Drivers 


Kernel 









































BPF is now a technology name 
and no longer an acronym 



BPF Internals 


BPF Instructions Events 



Rest of 
Kernel 
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Is BPF Turing complete? 






















A New Type of Software 



Execution 

model 

User 

defined 

Compil¬ 

ation 

Security 

Failure 

mode 

Resource 

access 

User 

task 

yes 

any 

user 

based 

abort 

syscall, 

fault 

Kernel 

task 

no 

static 

none 

panic 

direct 

BPF 

event 

yes 

JIT, 

CO-RE 

verified, 

JIT 

error 

message 

restricted 

helpers 


Example Use Case: 
BPF Observability 



BPF enables a new class of 

custom, efficient, and production safe 

performance analysis tools 



BPF 

Perf 

Tools 



BPF 

Performance Tools 



filetop 

filelife fileslower 
vfscount vfsstat- 


filetype fsrwstat 
vfssize imnapfiles 
writesync 

cachestat cachetop 
dcstat dcsnoop 
mountsnoop 

icstat 
bufgrow 
readahead 

writeback 

trace 
argdist 
funccount 
funcslower 
funclatency 
stackcount 
profile 

btrfsdist 
btrfsslower 
ext4dist ext4slower 
nfsslower nfsdist 
xfsslower xfsdist 
zfsslower zfsdist 
overlayfs 

mdflush 

biotop biosnoop 
biolatency 
bitesize 

seeksize 
biopattern 
biostacks 
bioerr 
iosched 
blkthrot 


opensnoop 
statsnoop 
syncsnoop 

ioprofile 


java* node* php* 
python* ruby* 

mysqld_qslower 


javathreads 



Legend: 

prior tool 
new tool 


sockstat sofamily 
soprotocol sormem 
soconnect soaccept 
socketio socksize 
soconnlat solstbyte 
skbdrop skblife 


ieee80211scan 
nvmelatency 

tcptop tcplife tcptracer 
tcpconnect tcpaccept 
tcpconnlat tcpretrans 
tcpsubnet tcpdrop 
tcpstates 
tcpsynbl tcpwin 
tcpnagle tcpreset 
udpconnect 


gethostlatency 
memleak 
sslsniff 

threadsnoop 
pmlock pmheld 

syscount 
killsnoop 

shellsnoop 
ignals naptime 
eperm setuids 

elfsnoop modsnoop 
execsnoop exitsnoop 
pidpersec 

cpudist cpuwalk 
runqlat runqlen 
runqslower 
cpuunclaimed 
deadlock 

offcputime wakeuptime 
offwaketime softirqs 

offcpuhist threaded 
pidnss mlock mheld 
smpcalls workq 

slabratetop 
oomkill memleak 
shmsnoop drsnoop 

kmem kpages numamove 
mmapsnoop brkstack 
ipecn faults ffaults 

superping fmapfault hfaults 

qdisc-fq vmscan swapin 

\ 

hardirqs 
criticalstat 
ttysnoop 


Other: 

capable 

xenhyper 

kvmexits 


llcstat 

cpufreq 











































Ubuntu Install 


ubuntu 


BCC (BPF Compiler Collection): complex tools 
# apt install bcc 


bpftrace: custom tools (Ubuntu 19.04+) 

# apt install bpftrace 


These are default installs at Netflix, Facebook, etc. 




Example: BCCtcpIife 

Which processes are connecting to which port? 


• £)julia EvansC^ s 

@bOrk 

i really wish i had a command line tool that would give 
me stats on TCP connection lengths on a given port 

2:48 PM • Aug 16, 2016 • Twitter Web Client 


4 Retweets 23 Likes 


Example: BCCtcpIife 

Which processes are connecting to which port? 


# ./tcplife 


PID 

COMM 

LADDR 

LPORT 

RADDR 

RPORT 

TX_KB 

RX_KB 

MS 

22597 

recordProg 

127.0.0.1 

46644 

127.0.0.1 

28527 

0 

0 

0.23 

3277 

redis-serv 

127.0.0.1 

28527 

127.0.0.1 

46644 

0 

0 

0.28 

22598 

curl 

100.66.3.172 

61620 

52.205.89.26 

80 

0 

1 

91.79 

22604 

curl 

100.66.3.172 

44400 

52.204.43.121 

80 

0 

1 

121.38 

22624 

recordProg 

127.0.0.1 

46648 

127.0.0.1 

28527 

0 

0 

0.22 

3277 

redis-serv 

127.0.0.1 

28527 

127.0.0.1 

46648 

0 

0 

0.27 

22647 

recordProg 

127.0.0.1 

46650 

127.0.0.1 

28527 

0 

0 

0.21 

3277 

redis-serv 

127.0.0.1 

28527 

127.0.0.1 

46650 

0 

0 

0.26 


[■■■] 



Example: BCCtcpIife 


# tcplife -h 
./usage: tcplife.py [■ 

[-D 

■h] [-T] [-t] [-w] [-s] [-p PID] [-L LOCALPORT] 

REMOTEPORT] 

Trace the lifespan of 

TCP sessions and summarize 

optional arguments: 

-h, --help 
-T, --time 
-t, --timestamp 
-w, --wide 

-S, --CSV 
-p PID, --pid PID 

show this help message and exit 
include time column on output (HH:MM:SS) 
include timestamp on output (seconds) 
wide column output (fits IPv6 addresses) 
comma separated values output 
trace this PID only 

-L LOCALPORT, --localport LOCALPORT 

comma-separated list of local ports to trace. 

-D REMOTEPORT, --remoteport REMOTEPORT 

comma-separated list of remote ports to trace. 

examples: 

./tcplife 
./tcplife -t 
[...] 

# trace all TCP connect()s 

# include time column (HH:MM:SS) 



Example: BCC biolatency 

What is the distribution of disk I/O latency? Per second? 


Example: BCC biolatency 

What is the distribution of disk I/O latency? Per second? 


# ./biolatency -mT 1 5 

Tracing block device I/O... Hit Ctrl-C to end. 

06:20:16 

msecs : 

count 

distribution 

0 -> 1 : 

36 

1 ************************************** | 

2 -> 3 : 

1 

1* 1 

4 -> 7 : 

3 

1 *** j 

8 -> 15 : 

17 

1***************** j 

16 -> 31 : 

33 

1********************************** j 

32 -> 63 : 

7 

1 ****** * j 

64 -> 127 

6 

1 ****** j 

06:20:17 

msecs : 

count 

distribution 

0 -> 1 : 

96 

1************************************ | 

2 -> 3 : 

25 

1********* j 

4 -> 7 : 

29 

1*********** | 

[...] 






BCC/BPF: biolatency 



100.66.98.191:7402 

O 

X 


>88-1048575 

>144-524287 

.072-262143 

>536-131071 

(2768-65535 

.6384-32767 

8192-16383 

4096-8191 

2048-4095 

1024-2047 

512-1023 

256-511 

128-255 

64-127 

32-63 

16-31 

8-15 

4-7 

2-3 

0-1 
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ill 


c 

256-511: 

805 


11:26:20 


11:26:40 


11:27:00 


11:27:20 


11:27:40 


11:28:00 















Example: bpftrace readahead 

Is readahead polluting the cache? 


Example: bpftrace readahead 

Is readahead polluting the cache? 


# readahead.bt 

Attaching 5 probes. 

. . 


A C 

Readahead unused pages: 

128 

Readahead used page 

age 

(ms): 

@age_ms: 

[1] 

2455 

|@@@@@@@@@@@@@@@ | 

[ 2 , 4) 

8424 

j @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ j 

[4, 8) 

4417 

j @@@@@@@@@@@@@@@@@@@@@@@@@@@ j 

[8, 16) 

7680 

j @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ j 

[16, 32) 

4352 

j @@@@@@@@@@@@@@@@@@@@@@@@@@ j 

[32, 64) 

0 

i i 

[64, 128) 

0 

i i 

[128, 256) 

384 

|@@ 1 





#!/usr/local/bin/bpftrace 

kprobe: do_page_cache_readahead { @in_readahead[tid] = 1; } 

kretprobe: do_page_cache_readahead { @in_readahead[tid] = 0; } 

kretprobe:_page_cache_alloc 

/@in_readahead[tid]/ 

{ 

@birth[retval] = nsecs; 

@rapages++; 

} 

kprobe:mark_page_accessed 
/@birth[argO]/ 

{ 

@age_ms = hist((nsecs - @birth[argO]) / 1000000); 
delete(@birth[arg0]); 

@rapages--; 

} 

END 

{ 

printf("\nReadahead unused pages: %d\n", @rapages); 
printf("\nReadahead used page age (ms):\n M ); 
print(@age_ms); clear(@age_ms); 

clear(@birth); clear(@in_readahead); clear(@rapages); 

} 



Observability Challenges 


libc 

Jll 

no frame pointer 
" function tracing 

Broken off-CPU flame graph 1 

no frame pointer) 

sy.. 

do.. 

en.. 

fs.. 

fi.. 

lo.. 

tr.. 

in.. 

ha.. 

TC.. 

finish_task_switch 


\ 

tr.. 

schedule 

■1 

finish task switch \ 

finish task switch 

my.. 

futex wait queue me 

finish.. 

schedule \ 

schedule 

Pr.. 

futex_wait 

schedule 

futex wait queue me \ 

futex wait queue me 

Pr.. 

do futex 

do nan.. 

futex wait \ 

futex wait 

my..| 

j SyS futex 

hrtime.. 

do futex \ 

do futex 

di.. 

do syscall 64 

sys_na.. 

SyS futex \ 

SyS futex 

do.. 

| entry S YSCALL 64 after h.. 

do sys.. 

do syscall 64 \ 

do syscall 64 

ha.. | 

j pthread cond wait@@GLIBC.. 

entry.. 

entry SYSCALL 64 after hwframe \ 

entry SY SCALL 64 aft.. 

pf.. 

[unknown] 

nanosl.. 

pthread cond timedwait « 

pthread cond wait@(a)G.. 

St.. 

mysqld 















































Reality Check 


Many of our perf wins are from CPU flame graphs 

not CLI tracing 


Stack depth (0 - max) 


CPU Flame Graphs 


Flame Graph 



Alphabetical frame sort (A - Z) 







































BPF-based CPU Flame Graphs 


Linux 2.6 Linux 4.9 




















Observability of BPF 




BPF 

bpftool 

pert 

bpflist 






bpftool 

PID BPFID 

1 I 

# bpftool perf 

pid 1765 fd 6: prog_id 26 kprobe func blk_account_io_start offset 0 

pid 1765 fd 8: prog_id 27 kprobe func blk_account_io_done offset 0 

pid 1765 fd 11: prog_id 28 kprobe func sched_fork offset 0 

pid 1765 fd 15: prog_id 29 kprobe func ttwu_do_wakeup offset 0 

pid 1765 fd 17: prog_id 30 kprobe func wake_up_new_task offset 0 

pid 1765 fd 19: prog_id 31 kprobe func finish_task_switch offset 0 

pid 1765 fd 26: prog_id 33 tracepoint inet_sock_set_state 

pid 21993 fd 6: prog_id 232 uprobe filename /proc/self/exe offset 1781927 

pid 21993 fd 8: prog_id 233 uprobe filename /proc/self/exe offset 1781920 

pid 21993 fd 15: prog_id 234 kprobe func blk_account_io_done offset 0 

pid 21993 fd 17: prog_id 235 kprobe func blk_account_io_start offset 0 

pid 25440 fd 8: prog_id 262 kprobe func blk_mq_start_request offset 0 

pid 25440 fd 10: prog_id 263 kprobe func blk_account_io_done offset 0 


Event 

I 



# bpftool prog dump jited id 263 

int trace_req_done(struct pt_regs * ctx): 

Oxffffffffc082dc6f: 

; struct request *req = ctx->di; 

0: push %rbp 

1: mov %rsp,%rbp 

4: sub $0x38,%rsp 

b: sub $0x28,%rbp 

f: mov %rbx,OxO(%rbp) 

13: mov %rl3,0x8(%rbp) 

17: mov %rl4,0x10(%rbp) 

lb: mov %rl5,0xl8(%rbp) 

If: xor %eax,%eax 

21: mov %rax,0x20(%rbp) 

25: mov 0x70(%rdi),%rdi 

; struct request *req = ctx->di; 

29: mov %rdi,-0x8(%rbp) 

; tsp = bpf_map_lookup_elem((void *)bpf_pseudo_fd(l, -1), &req); 
2d: movabs $0xffff96e680ab0000,%rdi 

37: mov %rbp,%rsi 

3a: add $0xfffffffffffffff8,%rsi 

; tsp = bpf_map_lookup_elem((void *)bpf_pseudo_fd(l, -1), &req); 
3e: callq 0xffffffffc39a49cl 





LPC~2019, Arrialdo Carvalho de Melo 
CPU profiling of BPF programs 





































• • • 


“We should be able to single-step execution 
We should be able to take a core dump of all state 

- David S. Miller, LSFMM 2019 


ONE 

OPERATION 

ON? '* f *>*1 ONE 
STEP tv I ADDITION 


ONE 

INSTRUCTION 


UNIVAC 1 





Future 



Future Predictions 

More device drivers, incl. USB on BPF (ghk) 

Monitoring agents 

Intrusion detection systems 

TCP congestion controls 

CPU & container schedulers 

FS readahead policies 

CDN accelerator 



Take Aways 


BPF is a new software type 

Start using BPF perf tools on Ubuntu 

bcc, bpftrace 


Thanks 


NETFLIX 


BPF: Alexei Starovoitov, Daniel Borkmann, David S. Miller, Linus Torvalds, BPF 
community 

BCC: Brenden Blanco, Yonghong Song, Sasha Goldsthein, BCC community 
bpftrace: Alastair Robertson, Matheus Marchini, Dan Xu, bpftrace community 
Canonical: BPF support, and libc-fp (thanks in advance) 

All photos credit myself; except slide 2 (Netflix) and 9 (KernelRecipes) 

© 

ubuntu 


